Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 93
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
J Chem Inf Model ; 64(8): 3205-3212, 2024 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-38544337

RESUMEN

Language models trained on domain-specific corpora have been employed to increase the performance in specialized tasks. However, little previous work has been reported on how specific a "domain-specific" corpus should be. Here, we test a number of language models trained on varyingly specific corpora by employing them in the task of extracting information from photocatalytic water splitting. We find that more specific corpora can benefit performance on downstream tasks. Furthermore, PhotocatalysisBERT, a pretrained model from scratch on scientific papers on photocatalytic water splitting, demonstrates improved performance over previous work in associating the correct photocatalyst with the correct photocatalytic activity during information extraction, achieving a precision of 60.8(+11.5)% and a recall of 37.2(+4.5)%.


Asunto(s)
Procesos Fotoquímicos , Agua , Agua/química , Catálisis
2.
J Chem Inf Model ; 64(4): 1187-1200, 2024 Feb 26.
Artículo en Inglés | MEDLINE | ID: mdl-38320103

RESUMEN

Machine learning (ML) methods can train a model to predict material properties by exploiting patterns in materials databases that arise from structure-property relationships. However, the importance of ML-based feature analysis and selection is often neglected when creating such models. Such analysis and selection are especially important when dealing with multifidelity data because they afford a complex feature space. This work shows how a gradient-boosted statistical feature-selection workflow can be used to train predictive models that classify materials by their metallicity and predict their band gap against experimental measurements, as well as computational data that are derived from electronic-structure calculations. These models are fine-tuned via Bayesian optimization, using solely the features that are derived from chemical compositions of the materials data. We test these models against experimental, computational, and a combination of experimental and computational data. We find that the multifidelity modeling option can reduce the number of features required to train a model. The performance of our workflow is benchmarked against state-of-the-art algorithms, the results of which demonstrate that our approach is either comparable to or superior to them. The classification model realized an accuracy score of 0.943, a macro-averaged F1-score of 0.940, area under the curve of the receiver operating characteristic curve of 0.985, and an average precision of 0.977, while the regression model achieved a mean absolute error of 0.246, a root-mean squared error of 0.402, and R2 of 0.937. This illustrates the efficacy of our modeling approach and highlights the importance of thorough feature analysis and judicious selection over a "black-box" approach to feature engineering in ML-based modeling.


Asunto(s)
Algoritmos , Aprendizaje Automático , Teorema de Bayes , Flujo de Trabajo , Bases de Datos Factuales
3.
J Chem Inf Model ; 64(5): 1486-1501, 2024 Mar 11.
Artículo en Inglés | MEDLINE | ID: mdl-38422386

RESUMEN

Molecular design depends heavily on optical properties for applications such as solar cells and polymer-based batteries. Accurate prediction of these properties is essential, and multiple predictive methods exist, from ab initio to data-driven techniques. Although theoretical methods, such as time-dependent density functional theory (TD-DFT) calculations, have well-established physical relevance and are among the most popular methods in computational physics and chemistry, they exhibit errors that are inherent in their approximate nature. These high-throughput electronic structure calculations also incur a substantial computational cost. With the emergence of big-data initiatives, cost-effective, data-driven methods have gained traction, although their usability is highly contingent on the degree of data quality and sparsity. In this study, we present a workflow that employs deep residual convolutional neural networks (DR-CNN) and gradient boosting feature selection to predict peak optical absorption wavelengths (λmax) exclusively from SMILES representations of dye molecules and solvents; one would normally measure λmax using UV-vis absorption spectroscopy. We use a multifidelity modeling approach, integrating 34,893 DFT calculations and 26,395 experimentally derived λmax data, to deliver more accurate predictions via a Bayesian-optimized gradient boosting machine. Our approach is benchmarked against the state of the art that is reported in the scientific literature; results demonstrate that learnt representations via a DR-CNN workflow that is integrated with other machine learning methods can accelerate the design of molecules for specific optical characteristics.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Teorema de Bayes , Teoría Funcional de la Densidad , Análisis Espectral
4.
Sci Data ; 11(1): 80, 2024 Jan 17.
Artículo en Inglés | MEDLINE | ID: mdl-38233439

RESUMEN

A database of thermally activated delayed fluorescent (TADF) molecules was automatically generated from the scientific literature. It consists of 25,482 data records with an overall precision of 82%. Among these, 5,349 records have chemical names in the form of SMILES strings which are represented with 91% accuracy; these are grouped in a subsidiary database. Each data record contains one of the following four properties: maximum emission wavelength (λEM), photoluminescence quantum yield (PLQY), singlet-triplet energy splitting (ΔEST), and delayed lifetime (τD). The databases were created through text mining using ChemDataExtractor, a chemistry-aware natural-language-processing toolkit, which has been adapted for TADF research. The text-mined corpus consisted of 2,733 papers from the Royal Society of Chemistry and Elsevier. To the best of our knowledge, these databases are the first databases that have been auto-generated for TADF molecules from existing publications. The databases have been publicly released for experimental and computational applications in the TADF research field.

5.
J Chem Phys ; 159(19)2023 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-37971034

RESUMEN

With the emergence of big data initiatives and the wealth of available chemical data, data-driven approaches are becoming a vital component of materials discovery pipelines or workflows. The screening of materials using machine-learning models, in particular, is increasingly gaining momentum to accelerate the discovery of new materials. However, the black-box treatment of machine-learning methods suffers from a lack of model interpretability, as feature relevance and interactions can be overlooked or disregarded. In addition, naive approaches to model training often lead to irrelevant features being used which necessitates the need for various regularization techniques to achieve model generalization; this incurs a high computational cost. We present a feature-selection workflow that overcomes this problem by leveraging a gradient boosting framework and statistical feature analyses to identify a subset of features, in a recursive manner, which maximizes their relevance to the target variable or classes. We subsequently obtain minimal feature redundancy through multicollinearity reduction by performing feature correlation and hierarchical cluster analyses. The features are further refined using a wrapper method, which follows a greedy search approach by evaluating all possible feature combinations against the evaluation criterion. A case study on elastic material-property prediction and a case study on the classification of materials by their metallicity are used to illustrate the use of our proposed workflow; although it is highly general, as demonstrated through our wider subsequent prediction of various material properties. Our Bayesian-optimized machine-learning models generated results, without the use of regularization techniques, which are comparable to the state-of-the-art that are reported in the scientific literature.

6.
J Chem Inf Model ; 63(22): 7045-7055, 2023 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-37934697

RESUMEN

The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and F-score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15-20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness.


Asunto(s)
Algoritmos , Programas Informáticos , Lenguaje , Minería de Datos , Aprendizaje Automático Supervisado
7.
Sci Data ; 10(1): 651, 2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37739960

RESUMEN

We present an automatically generated dataset of 15,755 records that were extracted from 47,357 papers. These records contain water-splitting activity in the presence of certain photocatalysts, along with additional information about the chemical reaction conditions under which this activity was recorded. These conditions include any co-catalysts and additives that were present during water splitting, the length of time for which the photocatalytic experiment was conducted, and the type of light source used, including its wavelength. Despite the text extraction of such a wide range of chemical reaction attributes, the dataset afforded good precision (71.2%) and recall (36.3%). These figures-of-merit were calculated based on a random sample of open-access papers from the corpus. Mining such a complex set of attributes required the development of novel techniques in knowledge extraction and interdependency resolution, leveraging inter- and intra-sentence relations, which are also described in this paper. We present a new version (version 2.2) of the chemistry-aware text-mining toolkit ChemDataExtractor, in which these new techniques are included.

8.
J Chem Inf Model ; 63(19): 6053-6067, 2023 10 09.
Artículo en Inglés | MEDLINE | ID: mdl-37729111

RESUMEN

Knowledge in the chemical domain is often disseminated graphically via chemical reaction schemes. The task of describing chemical transformations is greatly simplified by introducing reaction schemes that are composed of chemical diagrams and symbols. While intuitively understood by any chemist, like most graphical representations, such drawings are not easily understood by machines; this poses a challenge in the context of data extraction. Currently available tools are limited in their scope of extraction and require manual preprocessing, thus slowing down the speed of data extraction. We present a new tool, ReactionDataExtractor v2.0, which uses a combination of neural networks and symbolic artificial intelligence to effectively remove this barrier. We have evaluated our tool on a test set composed of reaction schemes that were taken from open-source journal articles and realized F1 score metrics between 75 and 96%. These evaluation metrics can be further improved by tuning our object-detection models to a specific chemical subdomain thanks to a data-driven approach that we have adopted with synthetically generated data. The system architecture of our tool is modular, which allows it to balance speed and accuracy to afford an autonomous, high-throughput solution for image-based chemical data extraction.


Asunto(s)
Aprendizaje Profundo , Inteligencia Artificial , Redes Neurales de la Computación
9.
Chem Sci ; 14(13): 3600-3609, 2023 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-37006683

RESUMEN

Infrared spectroscopy is a ubiquitous technique used to characterize unknown materials in the form of solids, liquids, or gases by identifying the constituent functional groups of molecules through the analysis of obtained spectra. The conventional method of spectral interpretation demands the expertise of a trained spectroscopist as it is tedious and prone to error, particularly for complex molecules which have poor representation in the literature. Herein, we present a novel method for automatically identifying functional groups in molecules given the corresponding infrared spectra, which requires no recourse to database-searching, rule-based, or peak-matching methods. Our model employs convolutional neural networks that are capable of successfully classifying 37 functional groups which have been trained and tested on 50 936 infrared spectra and 30 611 unique molecules. Our approach demonstrates its practical relevance in the autonomous analytical identification of functional groups in organic molecules from infrared spectra.

10.
J Chem Inf Model ; 63(7): 1961-1981, 2023 04 10.
Artículo en Inglés | MEDLINE | ID: mdl-36940385

RESUMEN

Text mining in the optical-materials domain is becoming increasingly important as the number of scientific publications in this area grows rapidly. Language models such as Bidirectional Encoder Representations from Transformers (BERT) have opened up a new era and brought a significant boost to state-of-the-art natural-language-processing (NLP) tasks. In this paper, we present two "materials-aware" text-based language models for optical research, OpticalBERT and OpticalPureBERT, which are trained on a large corpus of scientific literature in the optical-materials domain. These two models outperform BERT and previous state-of-the-art models in a variety of text-mining tasks about optical materials. We also release the first "materials-aware" table-based language model, OpticalTable-SQA. This is a querying facility that solicits answers to questions about optical materials using tabular information that pertains to this scientific domain. The OpticalTable-SQA model was realized by fine-tuning the Tapas-SQA model using a manually annotated OpticalTableQA data set which was curated specifically for this work. While preserving its sequential question-answering performance on general tables, the OpticalTable-SQA model significantly outperforms Tapas-SQA on optical-materials-related tables. All models and data sets are available to the optical-materials-science community.


Asunto(s)
Minería de Datos , Suministros de Energía Eléctrica , Lenguaje , Ciencia de los Materiales , Procesamiento de Lenguaje Natural
11.
Inorg Chem ; 62(1): 318-335, 2023 Jan 09.
Artículo en Inglés | MEDLINE | ID: mdl-36541860

RESUMEN

Contemporary electrocatalysts for the reduction of CO2 often suffer from low stability, activity, and selectivity, or a combination thereof. Mn-carbonyl complexes represent a promising class of molecular electrocatalysts for the reduction of CO2 to CO as they are able to promote this reaction at relatively mild overpotentials, whereby rare-earth metals are not required. The electronic and geometric structure of the reaction center of these molecular electrocatalysts is precisely known and can be tuned via ligand modifications. However, ligand characteristics that are required to achieve high catalytic turnover at minimal overpotential remain unclear. We consider 55 Mn-carbonyl complexes, which have previously been synthesized and characterized experimentally. Four intermediates were identified that are common across all catalytic mechanisms proposed for Mn-carbonyl complexes, and their structures were used to calculate descriptors for each of the 55 Mn-carbonyl complexes. These electronic-structure-based descriptors encompass the binding energies, the highest occupied and lowest unoccupied molecular orbitals, and partial charges. Trends in turnover frequency and overpotential with these descriptors were analyzed to afford meaningful physical insights into what ligand characteristics lead to good catalytic performance, and how this is affected by the reaction conditions. These insights can be expected to significantly contribute to the rational design of more active Mn-carbonyl electrocatalysts.

12.
Chem Sci ; 13(39): 11487-11495, 2022 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-36348711

RESUMEN

Due to the massive growth of scientific publications, literature mining is becoming increasingly popular for researchers to thoroughly explore scientific text and extract such data to create new databases or augment existing databases. Efforts in literature-mining software design and implementation have improved text-mining productivity, but most of the toolkits that mine text are based on traditional machine-learning-algorithms which hinder the performance of downstream text-mining tasks. Natural-language processing (NLP) and text-mining technologies have seen a rapid development since the release of transformer models, such as bidirectional encoder representations from transformers (BERT). Upgrading rule-based or machine-learning-based literature-mining toolkits by embedding transformer models into the software is therefore likely to improve their text-mining performance. To this end, we release a Python-based literature-mining toolkit for the field of battery materials, BatteryDataExtractor, which involves the embedding of BatteryBERT models in its automated data-extraction pipeline. This pipeline employs BERT models for token-classification tasks, such as abbreviation detection, part-of-speech tagging, and chemical-named-entity recognition, as well as new double-turn question-answering data-extraction models for auto-generating repositories of inter-related material and property data as well as general information. We demonstrate that BatteryDataExtractor exhibits state-of-the-art performance on the evaluation data sets for both token classification and automated data extraction. To aid the use of BatteryDataExtractor, its code is provided as open-source software, with associated documentation to serve as a user guide.

13.
Sci Data ; 9(1): 648, 2022 10 22.
Artículo en Inglés | MEDLINE | ID: mdl-36272983

RESUMEN

An auto-generated thermoelectric-materials database is presented, containing 22,805 data records, automatically generated from the scientific literature, spanning 10,641 unique extracted chemical names. Each record contains a chemical entity and one of the seminal thermoelectric properties: thermoelectric figure of merit, ZT; thermal conductivity, κ; Seebeck coefficient, S; electrical conductivity, σ; power factor, PF; each linked to their corresponding recorded temperature, T. The database was auto-generated using the automatic sentence-parsing capabilities of the chemistry-aware, natural language processing toolkit, ChemDataExtractor 2.0, adapted for application in the thermoelectric-materials domain, following a rule-based sentence-simplification step. Data were mined from the text of 60,843 scientific papers that were sourced from three scientific publishers: Elsevier, the Royal Society of Chemistry, and Springer. To the best of our knowledge, this is the first automatically-generated database of thermoelectric materials and their properties from existing literature. The database was evaluated to have a precision of 82.25% and has been made publicly available to facilitate the application of data science in the thermoelectric-materials domain, for analysis, design, and prediction.

14.
Nat Chem ; 14(9): 973-975, 2022 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-36008602
15.
RSC Adv ; 12(26): 16656-16662, 2022 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-35754871

RESUMEN

We outline procedures to calculate small-angle scattering (SAS) intensity functions from 2-dimensional electron-microscopy (EM) images. Two types of scattering systems were considered: (a) the sample is a set of particles confined to a plane; or (b) the sample is modelled as parallel, infinitely long cylinders that extend into the image plane. In each case, an EM image is segmented into particle instances and the background, whereby coordinates and morphological parameters are computed and used to calculate the constituents of the SAS-intensity function. We compare our results with experimental SAS data, discuss limitations, both general and case specific, and outline some applications of this method which could potentially complement experimental SAS.

16.
Sci Data ; 9(1): 329, 2022 Jun 17.
Artículo en Inglés | MEDLINE | ID: mdl-35715446

RESUMEN

The number of scientific publications reporting cutting-edge third-generation photovoltaic devices is increasing rapidly, owing to the pressing need to develop renewable-energy technologies that address the climate-change crisis. Consequently, the field could benefit from a central repository where photovoltaic-performance metrics, such as the power-conversion efficiency (η) are recorded. We present two automatically generated databases that contain photovoltaic properties and device material data for dye-sensitized solar cells (DSCs) and perovskite solar cells (PSCs), totalling 660,881 data entries representing 57,678 photovoltaic devices. The databases were generated by applying the text-mining toolkit ChemDataExtractor on a corpus of 25,720 articles. A multi-faceted evaluation, incorporating manual and automatic methods, was applied to ensure that the data contained therein were of the highest quality, with precision metrics ranging from 73.1% to 95.8%. The DSC database contains 475,045 entries representing 41,680 devices, and the PSC database contains 185,836 entries representing 15,818 devices. The databases are available in MongoDB and JSON formats, which can be queried in Python, R, Java and MATLAB for data-driven photovoltaic materials discovery.

17.
J Phys Chem C Nanomater Interfaces ; 126(13): 6047-6059, 2022 Apr 07.
Artículo en Inglés | MEDLINE | ID: mdl-35573119

RESUMEN

Recent discoveries of a range of single-crystal optical actuators are feeding a new form of materials chemistry, given their broad range of potential applications, from light-induced molecular motors to light sensors and optical-memory media. A series of ruthenium-based coordination complexes that exhibit sulfur dioxide linkage photoisomerization is of particular interest because they exhibit single-crystal optical actuation via either optical switching or nano-optomechanical transduction processes. We report the discovery of a new complex in this series of chemicals, [Ru(SO2)(NH3)4(3-fluoropyridine)]tosylate2 (1), which forms an η1-OSO photoisomer with 70% photoconversion upon the application of 505 nm light. The uncoordinated oxygen atom in this η1-OSO photoisomer impinges on one of the arene rings in a neighboring tosylate counter ion of 1 just enough that incipient nano-optomechanical transduction is observed. The structure and optical properties of this actuator are characterized via in situ light-induced single-crystal X-ray diffraction (photocrystallography), single-crystal optical absorption spectroscopy and microscopy, as well as single-crystal Raman spectroscopy. These materials-characterization methods were also used to track thermally induced reverse isomerization processes in 1. One of these processes involves an η1-OSO to η2-(OS)O transition, which was found to proceed sufficiently slowly at 110 K that its structural mechanism could be determined via a time sequence of photocrystallography experiments. The resulting data allowed us to structurally capture the transition, which was shown to occur via a form of coordination isomerism. Our newfound knowledge about this structural mechanism will aid the molecular design of new [RuSO2] complexes with functional applications.

18.
J Chem Inf Model ; 62(24): 6365-6377, 2022 12 26.
Artículo en Inglés | MEDLINE | ID: mdl-35533012

RESUMEN

A great number of scientific papers are published every year in the field of battery research, which forms a huge textual data source. However, it is difficult to explore and retrieve useful information efficiently from these large unstructured sets of text. The Bidirectional Encoder Representations from Transformers (BERT) model, trained on a large data set in an unsupervised way, provides a route to process the scientific text automatically with minimal human effort. To this end, we realized six battery-related BERT models, namely, BatteryBERT, BatteryOnlyBERT, and BatterySciBERT, each of which consists of both cased and uncased models. They have been trained specifically on a corpus of battery research papers. The pretrained BatteryBERT models were then fine-tuned on downstream tasks, including battery paper classification and extractive question-answering for battery device component classification that distinguishes anode, cathode, and electrolyte materials. Our BatteryBERT models were found to outperform the original BERT models on the specific battery tasks. The fine-tuned BatteryBERT was then used to perform battery database enhancement. We also provide a website application for its interactive use and visualization.


Asunto(s)
Suministros de Energía Eléctrica , Lenguaje , Humanos , Bases de Datos Factuales , Procesamiento de Lenguaje Natural
19.
Sci Data ; 9(1): 193, 2022 05 03.
Artículo en Inglés | MEDLINE | ID: mdl-35504897

RESUMEN

Large-scale databases of band gap information about semiconductors that are curated from the scientific literature have significant usefulness for computational databases and general semiconductor materials research. This work presents an auto-generated database of 100,236 semiconductor band gap records, extracted from 128,776 journal articles with their associated temperature information. The database was produced using ChemDataExtractor version 2.0, a 'chemistry-aware' software toolkit that uses Natural Language Processing (NLP) and machine-learning methods to extract chemical data from scientific documents. The modified Snowball algorithm of ChemDataExtractor has been extended to incorporate nested models, optimized by hyperparameter analysis, and used together with the default NLP parsers to achieve optimal quality of the database. Evaluation of the database shows a weighted precision of 84% and a weighted recall of 65%. To the best of our knowledge, this is the largest open-source non-computational band gap database to date. Database records are available in CSV, JSON, and MongoDB formats, which are machine readable and can assist data mining and semiconductor materials discovery.

20.
Sci Data ; 9(1): 192, 2022 05 03.
Artículo en Inglés | MEDLINE | ID: mdl-35504964

RESUMEN

The ability to auto-generate databases of optical properties holds great potential for advancing optical research, especially with regards to the data-driven discovery of optical materials. An optical property database of refractive indices and dielectric constants is presented, which comprises a total of 49,076 refractive index and 60,804 dielectric constant data records on 11,054 unique chemicals. The database was auto-generated using the state-of-the-art natural language processing software, ChemDataExtractor, using a corpus of 388,461 scientific papers. The data repository offers a representative overview of the information on linear optical properties that resides in scientific papers from the past 30 years. Public availability of these data will enable a quick search for the optical property of certain materials. The large size of this repository will accelerate data-driven research on the design and prediction of optical materials and their properties. To the best of our knowledge, this is the first auto-generated database of optical properties from a large number of scientific papers. We provide a web interface to aid the use of our database.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...